Linear Regression

Randy Johnson

6/21/2018

Overview

Least Squares Intuition

Least Squares Intuition

Least Squares Intuition

Linear Relationship

# some made up data
linExample <- data_frame(x1 = rnorm(100),
                         x2 = rnorm(100),
                         y = x1 + x2^2 + rnorm(100))

Linear Relationship - Component Residual Plots

Multivariate Normality

mvnExample <- data_frame(x1 = rnorm(100),
                         y1 = x1 + rnorm(100), # normal error terms
                         y2 = x1 + rt(100, 3)) # not-normal error terms

Multivariate Normality - Diagnostics (good)

## 
##  Shapiro-Wilk normality test
## 
## data:  .std.resid
## W = 0.98544, p-value = 0.3417

Multivariate Normality - Diagnostics (bad)

## 
##  Shapiro-Wilk normality test
## 
## data:  .std.resid
## W = 0.91781, p-value = 1.079e-05

Multivariate Normality - Model comparison

Model 1 has normal error terms.

## 
## Call:
## lm(formula = y1 ~ x1, data = mvnExample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7996 -0.6230  0.1467  0.7049  1.9721 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.15329    0.09750  -1.572    0.119    
## x1           0.91487    0.09161   9.987   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9619 on 98 degrees of freedom
## Multiple R-squared:  0.5044, Adjusted R-squared:  0.4993 
## F-statistic: 99.74 on 1 and 98 DF,  p-value: < 2.2e-16

Multivariate Normality - Model comparison

Model 2 has non-normal error terms.

## 
## Call:
## lm(formula = y2 ~ x1, data = mvnExample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4554 -1.1273 -0.1941  1.0091  8.7573 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1716     0.1932   0.888    0.377    
## x1            0.8026     0.1815   4.421 2.54e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.906 on 98 degrees of freedom
## Multiple R-squared:  0.1663, Adjusted R-squared:  0.1578 
## F-statistic: 19.55 on 1 and 98 DF,  p-value: 2.539e-05

No/Little Multicollinearity

mcol <- data_frame(x1 = rnorm(100),
                   x2 = x1 + rnorm(100), # x1 and x2 are correlated
                   x3 = rnorm(100),
                   y = x1 + x2 + x3 + rnorm(100))

# three models
model1 <- lm(y ~ x1 + x2 + x3, data = mcol) # everything
model2 <- lm(y ~ x1 +      x3, data = mcol) # without x2
model3 <- lm(y ~      x2 + x3, data = mcol) # without x1

No/Little Multicollinearity - Variance Inflation Factors

# check for multicollinearity
vif(model1)
##       x1       x2       x3 
## 2.271759 2.271338 1.005520
vif(model2)
##       x1       x3 
## 1.000791 1.000791
vif(model3)
##       x2       x3 
## 1.000606 1.000606

No/Little Multicollinearity - Model comparison

model x1 x2 x3 r2
1 0.94 1.02 0.98 0.87
2 1.88 0.93 0.76
3 1.59 1.03 0.81

Model coefficients for our three models. Notice how the coefficient for x3 doesn’t change much, but the coefficients for x1, x2 and R2 change quite a bit.

No Autocorrelation

No Autocorrelation

Adding a loess smoother highlights this periodic pattern in the data.

No Autocorrelation

lm(y1 ~ x1, data = dat) %>%
    durbinWatsonTest()
##  lag Autocorrelation D-W Statistic p-value
##    1       0.1940815      1.609687   0.002
##  Alternative hypothesis: rho != 0

Homoscedasticity

Homoscedasticity

y1 ~ x1

## 
## Suggested power transformation:  1.030183
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.05980537    Df = 1     p = 0.8068038

Homoscedasticity

y2 ~ x1

## 
## Suggested power transformation:  0.1781482
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 42.36333    Df = 1     p = 7.579794e-11

Homoscedasticity

## 
## Call:
## lm(formula = y1 ~ x1, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.86192 -0.55272 -0.04338  0.54027  2.54020 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.04454    0.20709   0.215  0.82991   
## x1           0.21653    0.06599   3.281  0.00122 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9762 on 198 degrees of freedom
## Multiple R-squared:  0.05158,    Adjusted R-squared:  0.04679 
## F-statistic: 10.77 on 1 and 198 DF,  p-value: 0.00122

Homoscedasticity

## 
## Call:
## lm(formula = y2 ~ x1, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2229 -1.1031 -0.2710  0.8684 10.1981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.3360     0.4025   0.835    0.405    
## x1            0.6483     0.1282   5.055 9.76e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.897 on 198 degrees of freedom
## Multiple R-squared:  0.1143, Adjusted R-squared:  0.1098 
## F-statistic: 25.56 on 1 and 198 DF,  p-value: 9.755e-07

Influential Observations

Influential Observations (good)

## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##    rstudent unadjusted p-value Bonferonni p
## 20 3.427632           0.001312     0.062977

##      StudRes        Hat      CookD
## 20 3.4276315 0.02443305 0.11925621
## 27 0.5655804 0.14003561 0.02643539

Influential Observations (bad)

##    rstudent unadjusted p-value Bonferonni p
## 49 6.307264         1.0012e-07   4.9057e-06

##     StudRes       Hat    CookD
## 49 6.307264 0.1480318 1.893596

Transformations

Many of the problems we encounter in linear regression stem from model choice.

Two most common transormations:

Power Transformations

Original model: y ~ β0 + β1x + ε

With power transformation: y ~ β0 + β1x2 + ε

Power Transformations

Original model: y ~ β0 + β1x + ε

With power transformation: y ~ β0 + β1sqrt(x) + ε

Log Transformations

Original model: y ~ β0 + β1x + ε

With log transformation: y ~ β0 + β1ln(x) + ε

Log Transformations

Original model: y ~ β0 + β1x + ε

With log transformation: ln(y) ~ β0 + β1x + ε

Assumptions Summary

Assumption Cause Consequence Diagnosis
Linear Relationship Incorrect model Inaccurate predictions car::crPlots()
Multivariate Normality Incorrect model Incorrect statistics car::qqPlot()
Noisy data (p-values / CIs) shapiro.test(residuals)
No/Little Multicollinearity Correlated variables Unstable model coefficients car::vif()
No Autocorrelation Non-independent data Inefficient estimators car::durbinWatsonTest()
Homoscedasticity Incorrect model Incorrect statistics car::ncvTest()
“Bad” data (p-values / CIs) car::spreadLevelPlot()

Thanks

Statistics for Lunch Team * Greg Alvord * Eckart Bindewald * Taina Immonen * Brian Luke * George Nelson * Ravi Ravichandran * Tom Schneider